Goto

Collaborating Authors

 weakly-supervised semantic segmentation



Causal Intervention for Weakly-Supervised Semantic Segmentation

Neural Information Processing Systems

We present a causal inference framework to improve Weakly-Supervised Semantic Segmentation (WSSS). Specifically, we aim to generate better pixel-level pseudo-masks by using only image-level labels -- the most crucial step in WSSS. We attribute the cause of the ambiguous boundaries of pseudo-masks to the confounding context, e.g., the correct image-level classification of horse and person may be not only due to the recognition of each instance, but also their co-occurrence context, making the model inspection (e.g., CAM) hard to distinguish between the boundaries. Inspired by this, we propose a structural causal model to analyze the causalities among images, contexts, and class labels. Based on it, we develop a new method: Context Adjustment (CONTA), to remove the confounding bias in image-level classification and thus provide better pseudo-masks as ground-truth for the subsequent segmentation model. On PASCAL VOC 2012 and MS-COCO, we show that CONTA boosts various popular WSSS methods to new state-of-the-arts.



Weakly-Supervised Audio-Visual Segmentation

Neural Information Processing Systems

Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases.




Review for NeurIPS paper: Causal Intervention for Weakly-Supervised Semantic Segmentation

Neural Information Processing Systems

Weaknesses: 1. Questions about the structural causal model 1) I feel that the confounder set C can be interpreted as "object shapes and where to place them". But I still do not have an intuitive way to interpret the image-specific context representation M. 2) Why is X - M instead of M - X? From my understanding, we sample object shapes and their locations to get M. And then later we sample object appearance (e.g., texture, lighting, etc.) to get X. 2. Implementation 1) Since the images in both VOC and COCO have different sizes and ratios, I wonder how the authors construct the confounder set C. 2) Is the segmentation mask X_m (L195) logits or probabilities? 3) I feel a bit confused about Eqn. It seems that W_1 and W_2 are used as projection matrices, reducing the dimension from original spatial size (hw) to the number of class (n). I wonder if this is reasonable.


Review for NeurIPS paper: Causal Intervention for Weakly-Supervised Semantic Segmentation

Neural Information Processing Systems

This paper proposes using a causal inference framework for weakly supervised semantic segmentation. It corrects mistakes in pseudomasks by adjusting for confounding effects. By relying on causal inference, as opposed to a discriminative model, the goal is to avoid relying on spurious correlations in the training data that might fail to generalize. The reviewers agree that using backdoor adjustments for semantic segmentation is a novel use of the technique, and that the experimental results are impressive. One suggestion for improvement in the camera ready is to more clearly state the modeling assumptions of the causal framework that is used, and to elaborate on what their implications are for this problem.


Expansion and Shrinkage of Localization for Weakly-Supervised Semantic Segmentation

Neural Information Processing Systems

Generating precise class-aware pseudo ground-truths, a.k.a, class activation maps (CAMs), is essential for Weakly-Supervised Semantic Segmentation. The original CAM method usually produces incomplete and inaccurate localization maps. To tackle with this issue, this paper proposes an Expansion and Shrinkage scheme based on the offset learning in the deformable convolution, to sequentially improve the recall and precision of the located object in the two respective stages. In the Expansion stage, an offset learning branch in a deformable convolution layer, referred to as expansion sampler'', seeks to sample increasingly less discriminative object regions, driven by an inverse supervision signal that maximizes image-level classification loss. The located more complete object region in the Expansion stage is then gradually narrowed down to the final object region during the Shrinkage stage.


Causal Intervention for Weakly-Supervised Semantic Segmentation

Neural Information Processing Systems

We present a causal inference framework to improve Weakly-Supervised Semantic Segmentation (WSSS). Specifically, we aim to generate better pixel-level pseudo-masks by using only image-level labels -- the most crucial step in WSSS. We attribute the cause of the ambiguous boundaries of pseudo-masks to the confounding context, e.g., the correct image-level classification of "horse" and "person" may be not only due to the recognition of each instance, but also their co-occurrence context, making the model inspection (e.g., CAM) hard to distinguish between the boundaries. Inspired by this, we propose a structural causal model to analyze the causalities among images, contexts, and class labels. Based on it, we develop a new method: Context Adjustment (CONTA), to remove the confounding bias in image-level classification and thus provide better pseudo-masks as ground-truth for the subsequent segmentation model. On PASCAL VOC 2012 and MS-COCO, we show that CONTA boosts various popular WSSS methods to new state-of-the-arts.